The recent run-up in the US stock market from the bottom of the pandemic sell-off was nothing short of remarkable. The Federal Reserve stepped in aggressively to stop the fallout from the pandemic. This resulted in the S&P 500 more than doubling, marking the fastest rally since World War 2. It felt like that market was a one-way street with no corrections. Many things have changed since the S&P 500 hit all-time highs in January 2022, including inflation not seen in decades, regime change by the FED, and certainly not least, the Russian invasion Ukraine. In the face of these strong headwinds, the equity market has been resilient, and only a few percentage points below its all-time high.
I think many investors are asking themselves a simple question - Is the stock market, particularly the US market, overvalued? In order to answer this question for myself, I want to look at historical data and use a simple, statistical approach to get a better sense of the overall US stock market. The goal is to expand this analysis, make it more robust, and create different tools that are dynamic.
I already have OHLC price data, including daily returns in .csv format from 1950 through April 2019. Data was obtained from supplementary material provided by the excellent book Quantitative Investing by Lingie Ma. From the same source, I also have monthly returns data from 1918 through 1949. I will use the yahoo finance API to pull the remaining data through March 31st, 2022, and create a separate .csv file. Yahoo finance is not always reliable, but it will work in this case.
library(tidyverse)
library(tidyquant)
library(reactable)
library(DT)
library(plotly)# loading the csv file for price data from 1950 though April 2019
old_prices <- read_csv("SP500_daily_Index_Price_2019.csv")# fetch remaining S&P 500 prices
new_prices <- tq_get("^GSPC",
get = "stock.prices",
from = "2019-04-18",
to = "2022-04-01") %>%
# change row names from ^GSPC to S&P 500 in the symbol column for readability
mutate(symbol = replace(symbol, symbol == "^GSPC", "S&P 500"))
# save imported prices to a new csv file
new_prices <- write_csv(new_prices, "SP500_daily_Index_Price_2022.csv") Looking at the first 6 rows from the two different tables, it is immediately observable that some data wrangling is required before I can actually start my analysis process.
old_prices
new_prices
I am going to first make some changes to the (old_prices) table.
Change column name yyyymmdd to date, and change dates to the international standard convention. These changes will match the (prices_new) table
Change date column data type from dbl to date
Remove the column header …7 which is not needed and contains no values
# drop the column with NA values and rename first column to date
# limit decimal number to 5 in return column
old_prices_clean <- old_prices %>%
select(-...7) %>%
rename(date = yyyymmdd) %>%
mutate(return = round(return, 5))
# use lubridate function to change data type of the date column
old_prices_clean <-
mutate(old_prices_clean, date = ymd(old_prices_clean$date))I now need to make the following changes to the (new_prices) table.
Remove symbol column since we are dealing with the S&P 500 only and is therefore not needed
Remove volume and adjusted columns. The adjusted prices are the same as the close prices therefore redundant and not needed
Add a return column and calculate daily returns
Limit return calculation to 5 decimal places to match the first table
# drop the symbol, volume, and adjusted columns
new_prices_clean <- new_prices %>%
select(-symbol, -volume, -adjusted)
# use tidyquant function to add a returns column and calculate daily returns
# rename the created daily.returns column to return
# limit decimal number to 5 in return column
new_prices_clean <-
tq_mutate(new_prices_clean, select = close, mutate_fun = periodReturn, period = "daily", ) %>%
rename(return = daily.returns) %>%
mutate(return = round(return, 5))old_prices_clean
new_prices_clean
It is looking a lot better now. There is one problem however, and that is the first row of the new_prices_clean table. It shows a 0 return for 2019-4-18 as there is nothing prior in this table to calculate the return. The old_prices_clean table already has this date and return included, therefore I can just remove this row.
Last part will be to merge these two tables to have one combined dataframe, and I will also save a new csv file of this newly created table for future reference.
# remove the first row
new_prices_clean <- new_prices_clean[-1,]
# combine the two tables
combined_prices_clean <- bind_rows(old_prices_clean, new_prices_clean)
# save to a new csv file
write_csv(combined_prices_clean, "SP500_daily_Index_Price_Combined.csv") It is time to start looking at some numbers, including risk and return. I will look at the data as a whole, but also splitting it into 3 different time periods.
First Period: 1918 - 1949
Second Period: 1950 - 2000
Third Period: 2001 - 2022
Why these periods? My thought process is that I want a little bit of granularity of the data to be able to compare different time periods in history. This ranges from wars to different interest rate regimes. There is no exact science here but I feel it is good enough for this purpose.
Let us take a quick look at the S&P 500 from 1950 to today
index <- combined_prices_clean %>%
ggplot(aes(x = date, y = close)) +
geom_line(color = palette_light()[[1]]) +
labs(title = "S&P 500 Index",
subtitle = "1950 - 2022",
y = "Closing Price", x = "") +
theme_tq() +
theme(panel.border = element_blank())
ggplotly(index)